Public Interest::Data Ethics &
Practice
You can import pretty much any data format into R if you know the
right command and (package):
Last week we read in a csv file saved in the data folder within our
project folder using the (tidyverse/readr) function
read_csv("filepath/filename"). That is, we read the data
locally (from a local device).
read_csv() will guess the format/variable type of
each column; if we want more control over how the data is read in, we
can tell R the variable type for each column with an argument –
col_types(cfin) – where the letters in the parentheses are
shorthand for variable types entered in the sequence of the
columns/variables we want the format applied to
We can read in csv files directly from a URL as well without saving the csv file directly into the computer. When it is possible to read the data from a URL, this makes the script more reproducible by others (assuming the link doesn’t disappear!).
read_csv("url")csv formats are simple text files, which makes them easy to read.
Data is often stored and shared in excel files, which are harder to
read. The readxl package makes this easier.
read_excel("filepath/filename.xlsx")sheet = 1 to read in the first sheet; if
sheets are named, can call sheets by nameskip = 1 to skip the first row; excel
spreadsheets are often written for humans to read rather than computers,
so header information is more commonrange = b3:c100 to read in only values in
the identified range of cellsExcel files cannot be read in from a URL. They must first be downloaded to your computer and read in locally. (You can also download csv files first and read them in locally.)
download.file("url", "destfolder/filename.xlsx")If an excel file has a download link, it is more reproducible to download the file via the script. To insure such a script works for anyone, you can include a code snippet that creates a folder to download to, e.g.,
if (!dir.exists("destfolder")){
dir.create("destfolder")
}
desc()Summarize according to a summary function
Summary functions include
| Summary Functions | |
|---|---|
| first(): first value | sum(): sum of values |
| last(): last value | n(): number of values |
| nth(.x, n): nth value | n_distinct(): number of distinct values |
| min(): minimum value | mean(): mean value |
| max(): maximum value | var(): variance |
| median(): median value | sd(): standard deviation |
| quantile(.x, probs = .25): | *IQR(): interquartile range |
Things to note:
Summarize is especially helpful when combined with
group_by
Aggregate/group by value(s) of column(s).
mutateCreate new columns or alter existing columns
if_else,
case_whendf <- df %>%
mutate(newvar = if_else(condition, value_if_true, value_if_false, value_if_na))
df <- df %>%
mutate(newvar = case_when(
condition1 ~ value1,
condition2 ~ value2,
condition3 ~ value3,
TRUE ~ value_everything_else)
across() can also be used within mutateFirst go to slack and copy the practice script for today (week2script.R) into your weeklymaterials/scripts folder from last week. Then open an RStudio session using the weeklymaterials.Rproj file.